Introduction

library(tidyverse)
demographics = read.csv('data/demos_anonymized.csv')
ids = read.csv('data/ids_anonymized.csv')
model_variables = read.csv('data/model_variables_anonymized.csv')

Get a Quick Summary

It’s always a good idea to glance at the data

glimpse(demographics)
summary(demographics[,1:10])

Grouping and Summarizing Operations

A very common operation is to do things by group(s) then create new summary variables.

demographics %>% 
  group_by(libuser) %>% 
  summarise(age = mean(age, na.rm = T)) 

Grouping and Summarizing Operations

Do multiple operations at once

demographics %>% 
  group_by(libuser) %>% 
  summarise(age_mean = mean(age, na.rm = T),
            age_sd = sd(age, na.rm = T),
            age_max = max(age, na.rm = T),
            prop_male = mean(gender=='Male', na.rm = T)) 

Mapping functions to groups

We’ll use the map function to map the sum function to each element in the list

x = list(1:3, 4:6, 7:9)
map(x, sum)

Grouping and Summarizing Operations

We can even do this with modeling and other operations.

This extracts the coefficients from a model run for each group.

demographics %>% 
  drop_na(race) %>% 
  group_split(race) %>%   
  map(~lm(award_total_amount ~ gender, data = .))

Visualization

Visualizing data requires that you: - Consider carefully the information you want to display - And then how you want to display it

Tell a story with the data.

And have some fun with it!

ggplot2

The most widely used visualization package in R

Layers

Visualization can be thought of in a layered fashion

Start with the base, then build up

More pipes

ggplot2 uses a + as a pipe to add layers

library(ggplot2)
ggplot(data) +
  geom_point(aes(x = var1, y = var2)) +
  geom_line() +
  theme(plot.caption = element_text(size = 6))

We can pipe to any ggplot as we did before (%>%)

Aesthetics

Aesthetics (aes) map variables to visual properties

Geoms are the geometric units we want to display

Aesthetics

model_variables %>% 
  filter(award_total_amount > 1e7) %>% 
  ggplot() +
  geom_density(aes(x = award_total_amount, 
                   color = factor(gender),
                   fill = factor(gender)),
               alpha = .2) + 
  scale_x_continuous(breaks = (1:10) * 1e7, trans = 'log')
demographics %>% 
  filter(award_year_start < 2020 & award_year_start > 1990) %>%  
  ggplot(aes(award_year_start, award_total_log)) + 
  geom_smooth(aes(color=factor(libuser)))

Stats

We can also use ggplot2 to create statistics we want to visualize

Typically used indirectly when geoms are called

Can be used for more direct control

ggplot(model_variables, aes(age, award_total_amount)) +
  geom_point(alpha = .02) +
  stat_ellipse(color = '#ff5500')

Scales

Scales are used to add specifications to axes, colors, etc.

model_variables %>% 
  filter(award_total_amount >= 1e6) %>% 
  ggplot() +
  geom_density(aes(x = award_total_amount, 
                   color = gender,
                   fill = gender),
               alpha = .2) + 
  scale_x_continuous(breaks = c(1e6, 5e6, 1e7, 2.5e7, 5e7), 
                     trans = 'log') +
  scale_fill_viridis_d(begin = .25, end = .5) +
  scale_color_viridis_d(begin = .25, end = .5) 

Facets

Facets allow another dimension to plots by group

model_variables %>% 
  ggplot() +
  geom_density(aes(x = award_total_amount, 
                   color = libuser,
                   fill = libuser),
               alpha = .2) +
  facet_wrap(~gender)

Themes

Themes allow for customization

Two uses of a - a built-in versions (e.g. theme_minimal) - DIY (theme(…))

For the theme function, each argument, takes on a specific value or an element function:

Themes

model_variables %>% 
  ggplot() +
  geom_smooth(aes(x = age,
                y = award_total_amount,
                color = libuser),
               alpha = .2) + 
  theme_minimal()
model_variables %>% 
  ggplot() +
  geom_smooth(aes(x = age,
                y = award_total_amount,
                color = libuser),
               alpha = .2) + 
  theme(axis.text.x = element_text(size=12),
        panel.grid.minor.x = element_blank(),
        plot.background = element_rect(color = 'rosybrown'),
        panel.background = element_rect(fill = 'papayawhip'))

Interactivity

Interactivity is a must-have tool for web-based presentation

Use to enhance exploration of the data - Not just because one can

Allows for additional dimensions

Even useful for exploring raw data

Interactivity

General

Interactivity

Specific functionality:

Plotly

traces - add_, work similar to geoms

modes - allow for points, lines, text and combinations

aesthetics - variables are denoted with ~, constants do not use - x =~ var1 vs x = 2

Plotly

Plotly uses the standard pipe %>%

library(plotly)

model_variables %>% 
  plot_ly(x = ~gender, y = ~ age) %>% 
  add_boxplot(color =~ gender)

Plotly

library(plotly)

model_variables %>% 
  group_by(libuser, gender) %>% 
  summarise(award = mean(award_total_amount)) %>% 
  plot_ly(x = ~libuser, 
          y = ~award, 
          color = ~gender,
          text = ~round(award), 
          textposition = 'auto', 
          type = 'bar') %>% 
  layout(bargap = 0.25, 
         bargroupgap = 0.25)

Plotly

init = glm(award_total_amount >= 5000000 ~ age*libuser, 
           data = model_variables, 
           family = binomial)

model_variables %>% 
  modelr::add_predictions(init, type = 'response') %>% 
  plot_ly(x = ~age, y = ~ pred) %>% 
  add_lines(color =~ libuser, line = list(shape = "spline")) %>% 
  layout(title = 'Predicted Prob. Award > 1 mil')

Plotly

Use ggplotly to turn our formerly static plots into interactive ones.

p = model_variables %>% 
  ggplot() +
  geom_density(aes(x = log(award_total_amount), 
                   color = libuser,
                   fill = libuser),
               alpha = .2) + 
  facet_wrap(~ gender) 
ggplotly()

Python examples

Python has come a long way in terms of data processing. There is a misconception that it is faster and less memory-hungry than R, but this depends on many factors and is generally not true.

Init

# note how when using something other than R, you have to specify the engine path
import pandas as pd
import numpy as np

demographics = pd.read_csv('data/demos_anonymized.csv')
ids = pd.read_csv('data/ids_anonymized.csv')
model_variables = pd.read_csv('data/model_variables_anonymized.csv')

Grouping and Summarizing data

demographics.describe(include = 'all')
demographics.describe(include = [np.number])
demographics.describe(include = [np.object])
lib_group = demographics.groupby('libuser', sort=True, )

# automatically chooses numeric
lib_group.mean()
lib_group.get_group(0).head()
lib_group.size()
lib_group.describe()

Mapping a function

x = [[1,2,3], [4,5,6], [7,8,9]]
list(map(np.sum, x))
demographics.select_dtypes('number').apply(np.mean)

Visualization

matplotlib is the most common visualization module in Python, though it’s fairly dated at this point. As such we’ll use a ggplot implementation in Python called plotnine.

Unfortunately for plotly, the interactivity makes it unusable within the R notebook (at present), so you may need to switch to Anaconda or other IDE to try other modules like plotly. Even Python users will still use R for easier visualization though, so feel free to do what you like there, then use ggplot etc. in R when the time comes.

That said, I’ll show a couple plots

import plotnine
dplot = demographics[(demographics.award_year_start < 2020) & (demographics.award_year_start > 1990) & (demographics.award_total_log > 11)]
dplot
ggplot(dplot, aes(x='award_year_start', y='award_total_log')) + geom_smooth(aes(group = 'libuser', color = 'libuser'))
year_award_average = demographics[demographics.libuser == 1].groupby(['award_home_dept', 'award_year_start']).mean()
year_award_average

Example boxplot with plotly.

import plotly
import plotly.plotly as py
from plotly.offline import init_notebook_mode
import plotly.graph_objs as go
plotly.offline.init_notebook_mode(connected=True)
y0 = np.random.randn(50)-1
y1 = np.random.randn(50)+1

trace0 = go.Box(
    y=y0
)
trace1 = go.Box(
    y=y1
)
data = [trace0, trace1]
py.iplot(data)